Towards interpretable Cryo-EM: disentangling latent spaces of molecular conformations

Molecules are essential building blocks of life and their different conformations (i.e., shapes) crucially determine the functional role that they play in living organisms. Cryogenic Electron Microscopy (cryo-EM) allows for acquisition of large image datasets of individual molecules. Recent advances in computational cryo-EM have made it possible to learn latent variable models of conformation landscapes. However, interpreting these latent spaces remains a challenge as their individual dimensions are often arbitrary. The key message of our work is that this interpretation challenge can be viewed as an Independent Component Analysis (ICA) problem where we seek models that have the property of identifiability. That means, they have an essentially unique solution, representing a conformational latent space that separates the different degrees of freedom a molecule is equipped with in nature. Thus, we aim to advance the computational field of cryo-EM beyond visualizations as we connect it with the theoretical framework of (nonlinear) ICA and discuss the need for identifiable models, improved metrics, and benchmarks. Moving forward, we propose future directions for enhancing the disentanglement of latent spaces in cryo-EM, refining evaluation metrics and exploring techniques that leverage physics-based decoders of biomolecular systems. Moreover, we discuss how future technological developments in time-resolved single particle imaging may enable the application of nonlinear ICA models that can discover the true conformation changes of molecules in nature. The pursuit of interpretable conformational latent spaces will empower researchers to unravel complex biological processes and facilitate targeted interventions. This has significant implications for drug discovery and structural biology more broadly. More generally, latent variable models are deployed widely across many scientific disciplines. Thus, the argument we present in this work has much broader applications in AI for science if we want to move from impressive nonlinear neural network models to mathematically grounded methods that can help us learn something new about nature.


Introduction
Molecules such as proteins or nucleic acids make up the building blocks of life.Living organisms contain a plethora of molecules that often comprise thousands of atoms.Biomolecules change their conformation (i.e., shape) to fulfill important biological functions such as enzymatic reactions or cellular communication.Understanding the conformational heterogeneity of biomolecules is crucial for deciphering their functional mechanisms and designing targeted interventions.Cryo-Electron Microscopy (cryo-EM) has emerged as a powerful technique for visualizing molecular structures at high resolution.Recent advancements in computational cryo-EM have demonstrated the potential of latent variable models to capture the diverse conformations adopted by biomolecules (reviewed in Donnat et al., 2022).However, interpreting these learned latent spaces and extracting biologically meaningful information from them remains a significant challenge.
In this paper, we propose a fruitful approach to unravel the complexities of conformational latent spaces in cryo-EM by framing this as an Independent Component Analysis (ICA) problem.In its original linear formulation, the high level goal of ICA is to discover linear projections of the data that are as statistically independent as possible (Hyvärinen and Oja, 2000).A common observation in ICA applications is that these linear projections discover underlying factors of variation in the data that give insight into the underlying processes.A few prior works have tested the application of linear ICA to molecular imaging (Borek et al., 2018;Gao et al., 2020) finding more meaningful separation of molecular conformation changes.However, the transformation between meaningful factors and the data is inherently nonlinear in cryo-EM.Therefore, we need theory and models that work for the nonlinear models used in modern cryo-EM (e.g., Zhong et al., 2021a).Building on recent theoretical work in identifiable nonlinear ICA Hyvärinen et al. (2024), in disentanglement models and their benchmarks in machine learning Locatello et al. (2019), we suggest a path to bridging the gap between theoretical advancements and practical applications in cryo-EM research.We argue that nonlinear ICA methods have the potential to provide a powerful framework to disentangling the latent representations of biomolecular conformations from cryo-EM datasets, overcoming the limitations of traditional volume visualization approaches and ultimately allowing to delve deeper into the structural dynamics of biomolecules.Moreover, we argue that the establishment of Overview.What does it mean to have a disentangled representation of molecular conformations?(A) The example (left) shows a simple molecule with two degrees of freedom 1. and 2. for changing its conformation.An entangled model (right, top) represents mixtures of both movements on each of its latent dimensions z 1 ~1.+ 2. and z 2 ~1.− 2. A disentangled model (right, bottom) represents pure movements on each of its latent dimensions z 1 ~1.and z 2 ~2.; actually, z 2 = −2.but the sign flip, incorporated in ~, does not compromise interpretability.Note that disentangling conformations from cryo-EM measurements requires additional information (e.g., time, temperature or physics), as discussed in section 4. (B) Training a VAE with separate pose ϕ and conformation z latent spaces on cryo-EM particle images, without any intervention.(C) Interpreting the learned latent space of a model.An axis traversal (blue) results in a complex motion of both arms, i.e., fails at disentangling the two degrees of freedom.A simple transformation, moving only the left arm, corresponds to a curved trajectory.Application of ICA to cryo-EM data (reproduced with permission from (Gao et al., 2020).Original caption: (A) Principal component analysis of the 18 multi-body parameters refined for each particle image yields 18 principal components (PC) displayed here in decreasing order of explained variance.The first three components explain more than 30% of the variability in the particle images.(B) (left) definition of the multi-body segmentation: the central PDE6 stalk in blue corresponds to Body 1, while the 2 GαT• GTP subunits correspond to Bodies two and 3. (right) The motion of each body is parameterized with three translational parameters and three rotational parameters.Each of the 18 principal and three independent components is a linear combination of the resulting 18 rigid-body parameters, and their weights are shown here for the first three principal components (from negligible to larger weight as the shade of grey becomes darker).benchmarks and metrics specific to cryo-EM disentanglement models is of paramount importance.Adapting and extending existing benchmarks from the machine learning field should allow to objectively evaluate the performance of different disentanglement approaches and track progress in the development of interpretable cryo-EM methods.Ultimately, this interdisciplinary approach will enhance our understanding of complex biological processes and open up new avenues for therapeutic interventions and drug discovery.
The paper is structured as follows.We first provide a general background on the cryo-EM computational problem (Section 2) and how it can be framed as an ICA problem (Section 2.1).We then discuss the two fundamental challenges associated with cryo-EM: firstly, separating the conformation and the pose representations (Section 2.2) and, secondly, finding the right (disentangled) representation of conformations (Section 2.3).We then go into more detail on both challenges by providing quantitative metrics to measure progress and modeling suggestions to improve current frameworks.To disentangle poses and shapes, we propose intervention based metrics (Section 3.1) and training schemes (Section 3.2) that can be added to existing models.For the larger problem of disentangling conformation representations (Section 4.1), we discuss existing disentanglement benchmarks and metrics (Locatello et al., 2019).We then discuss three potential approaches for solving this problem (Section 4.2), based on temporal information (Section 4.2.1),temperature control (Section 4.2.2) and atomic models (Section 4.2.3).Finally, we discuss the path forward and the broader implications for this framework to take computational approaches from neural network based curve fitting to actual understanding of the mechanisms in nature.

Interpretable heterogeneous reconstruction-a disentanglement problem
Heterogeneous cryo-EM reconstruction methods aim to model the different conformations that a molecule may assume (Donnat et al., 2022).For instance, we can think of a molecule with a fixed central structure and two adjustable "arms" (see Figure 1).Clearly, any conformation that this molecule may assume can be described by providing the position of both arms.Thus, these independently moving parts may be thought of as the fundamental degrees of freedom of this molecule's conformations.
We can parameterize them by a two dimensional latent variable z ∈ Z where Z ≔ R 2 is some degree of movement.The volume that the molecule occupies in three dimensional space can be thought of as a function v ∈ V, v: R 3 × Z → {0, 1} that is parameterized by z and indicates for any position in space (R 3 ) whether it is part of the molecular volume or not, known as an implicit representation of the volume (Sitzmann et al., 2020;Donnat et al., 2022).That means, for different values of z, v z = v(., z) would describe a different volume.Crucially, this function is not known and it is a central goal in heterogeneous cryo-EM reconstruction to learn and study it.For instance, one approach would be to train a neural network (v θ ) to approximate the true volume function v θ ≈ v* (Zhong et al., 2021a).
Furthermore, in cryo-EM we typically see a projection π: V × Φ → R N to a gray-scale pixel image (represented, to keep notation uncluttered, as a vector with N entries).This projection depends on the pose parameters ϕ ∈ Φ = SO(3), so we will also refer to π as the pose function (Table 1).The pose parameters may also need to be inferred (typically, the cryo-EM image formation model would also include camera parameters such as the microscope defocus-we are skipping those for simplicity).That means, for different values of ϕ, π ϕ = π(., ϕ) would describe a different projection.Importantly, the function π does not have to be learned because we know the physics, i.e., optics behind this projection, thus, we know that the projection in our model π ϕ must be the same one as the ground truth projection π ϕ * π ϕ .
Putting this together we can write the combined cryo-EM generative model (i.e., the abstract process that yields the data we observe).That is, the observed data x is modeled as being generated by the ground truth model as x = g*(z*, ϕ*), which is, crucially, a function of the ground truth latent quantities (z*, ϕ*) (1) Usually, we would be measuring very noisy signals x = g*(z*, ϕ*) + ϵ where the noise can be modeled as additive Gaussian ϵ ~N (0, σ 2 ) in image space.The essential problem of heterogeneous reconstruction in computational cryo-EM can now be stated as follows.
This would be the true volume function v* that shows us how the independent degrees of freedom change the molecule's conformation.Many cryo-EM models actually learn a probabilistic p(x|z, ϕ) representation of the observed data x conditioned on the latent variables.Thus, it becomes necessary to perform inference such as maximum a posteriori estimation of the latents, conditioned on some observed data p(z, ϕ|x).Alternatively, it is common to approximate the posterior distribution itself with an amortized variational method such as a variational autoencoder (VAE) (Kingma and Welling, 2013).In this work we will be agnostic about the inference procedure (maximum a posteriori probability (MAP), or mean of the amortized variational posterior) and just assume that there exists a mapping f: X → Z from data to latent variables.

Heterogeneous reconstruction in Cryo-EM is an ICA problem
Let us compare this to a standard independent component analysis (ICA) setting (Comon, 1994).In ICA we assume that there are K > 1 independent variables collected in the random vector z = (z 1 , . .., z K ).As an example to illustrate this, we may think of a public space where K different speakers proclaim their prophecies z i , completely independently of one another p(z i , z j ) = p(z i )p(z j ), ∀i ≠ j.However, we do not observe those z directly.Instead, we observe K linear combinations of those variables x Az + ϵ, with A ∈ R K×K some unknown full rank matrix and, again, with additive Gaussian ϵ ~N (0, σ 2 ).In our example, this may correspond to K microphones placed in the space and each recording some linear combination A T i z of the speech signals.This scenario is also called blind source separation, the term "blind" referring to the idea that we know almost nothing about the "sources" z i , apart from some general statistical properties.In linear ICA, the function g: Z → X , g(z) Az that maps from sources z to observations x is called the mixing function.This basic case of linear ICA has been well-studied in the machine learning and signal processing literature (Hyvärinen and Oja, 2000).Briefly, under the simple assumption that at most one of the sources z i follows a Gaussian distribution, we can find an unmixing function f: X → Z that approximately inverts the mixing function, in practice up to (f•g)(z) ~Cz, i.e., some simple equivalence class ~C such as permutations and scalings. 1f g*(z*, ϕ) in Eq. 1 was linear in z and ϕ, then the cryo-EM reconstruction problem would amount to a simple linear ICA problem with the (extended) sources (z ϕ) where denotes concatenation.Unfortunately, the cryo-EM mixing function g(z, ϕ) in Eq. 1 is nonlinear.This can be easily appreciated, e.g., by noting that a multiple of some latent αz will not produce the same output as an equally scaled image g(αz, ϕ) ≠ αg(z, ϕ) which would be just the same image but changed in brightness.In the case of a nonlinear mixing function g, (Hyvärinen and Pajunen, 1999) showed that it is possible to construct many functions f: X → Z that turn the data into independent variables.However, most of these independent variables have no intelligible relationship with the true sources z.This problem is called the lack of identifiability of the model, which in general mathematical terms means lack of uniqueness of the solution.
For cryo-EM models this would mean that we can learn latent spaces whose individual dimensions have no principled relationship with the true degrees of freedom in molecular conformations.As an example, we may end up with a representation of the simple two dimensional molecule from above where traversing any single dimension in the latent space of our model corresponds to complex  (Locatello et al., 2019).In seven out of eight metrics we see that SlowVAE learns a more disentangled representation of the conformation latents than a regular VAE without any adaptations.
combinations of the two arm movements (Figure 1).This would, likely, bias our interpretation of how they are articulated together to carry out their function.Thus, without further restrictions on our model, we would fail to discover the simple and elegant structure where the molecule just changes conformation along two independent degrees of freedom, i.e., left and right arm.In the modern machine learning context, finding a latent space that separates the underlying factors of variation is often called disentanglement (Bengio et al., 2013), but it has to be noted that the meaning of that term is quite vague.
Fortunately, recent advances propose ways to solve this problem with nonlinear ICA (Hyvärinen et al., 2024).For example, Khemakhem et al. (2020) adds conditioning ("auxiliary") variables u that change the source distributions p(z i |u).Such a u could represent extra measurements by another modality, or it could be defined by interventions.The model then becomes identifiable if the u modulates the distribution of z strongly enough.This is possible because then the z i are conditionally independent for any u, which provides much stronger constraints than the mere (unconditional) independence of the z i as in the basic ICA framework.Khemakhem et al. (2020) further propose to estimate this model using variational methods, leading to an algorithm which is a variant of VAEs.An alternative approach is possible by assuming temporal dependencies of the source time series (Hyvarinen and Morioka, 2017;Klindt et al., 2020;Hälvä et al., 2021); spatial dependencies can also be used (Hälvä et al., 2024).In this case, independence of the components over time lags leads, again, to more constraints, and thus to identifiability under some conditions.A very different approach can be developed by constraining the nonlinear function g, parameterizing it with such a small number of parameters that identifiability is obtained (Hyvärinen et al., 2024; Section 5.4); for example, if we know the physics underlying g (i.e., pose transformations and projections) we may also be able to obtain identifiability.Finally, we point out that the independence assumption can be relaxed (Träuble et al., 2021); even causal relationships between the independent components have been modeled, but this requires further constraints and assumptions (Träuble et al., 2021;Morioka and Hyvärinen, 2023;Yao et al., 2023).Any such learning is easier if interventions on the system are possible (Ahuja et al., 2023) or if it is assumed that the system undergoes sparse, discrete state changes like in robotics experiments (Locatello et al., 2020b), but the theory mentioned above is specifically unsupervised, thus not necessarily requiring interventions.
In this work, we will argue that the heterogeneous reconstruction problem in cryo-EM should be framed as a non-linear ICA problem to help us build better and more interpretable models that separate the independent degrees of freedom with which molecules change conformation in nature.Few prior works have applied linear ICA to molecular imaging (Borek et al., 2018;Gao et al., 2020), finding more meaningful separation of molecular conformation changes (Figure 2).However, to the best of our knowledge, none of the nonlinear ICA approaches mentioned above have so far been applied to the field of computational cryo-EM.Below we will propose three promising candidate approaches for solving the nonlinear ICA problem in cryo-EM.

Disentangling pose and conformation
The first challenge in computational cryo-EM is that of separating the pose ϕ of a molecule from its conformation z.Again, looking at the cryo-EM mixing function (Eq. 1) x = g*(z*, ϕ*) + ϵ, this means we want to find a representation that separates the estimated conformation z and the pose ϕ. 2 In other words, as a first step, we could find just two latent subspaces without specifying the individual components, or the bases, inside those Alternative Conformation Disentanglement.(A) Proof-of-concept illustration of Boltzmann ICA for the use of temperature as a conditioning variable (Khemakhem et al., 2020).At low temperature τ 1 (left) only the left arm of the molecule shows significant movement.At high temperature τ 2 both arms show significant movement.Conditioning the prior of the conformation latent with this information may allow identifying the distinct sources.(B) Proofof-concept illustration of Physics-based ICA using physics-based decoder with atomic models that assign mechanistic meaning to latent variables, e.g., in terms of atomic coordinates, (potentially, sparse) movement of volume (Punjani and Fleet, 2021), or local normal mode analysis (NMA) deformations (Nashed et al., 2022;Koo et al., 2023).
2 In the common framework of VAEs, f z (x) = μ z (x) could be defined as the mean of the variational posterior; in an auto-decoding framework this could be the MAP outcome of inference, i.e., f z (x) = argmax z p(x|z)p(z).
Frontiers in Molecular Biosciences frontiersin.org07 subspaces.Again, if g* were linear, we might use the well-developed methods of independent subspace analysis, or subspace ICA, to approach this problem (Hyvärinen and Hoyer, 2000;Theis, 2006).This might also help with the pose variables' topology that is, typically, not Euclidean.For instance, a circular pose variable ϕ i ∈ S 1 that lives on a circle and encodes rotations around one axis could not be represented, by a single dimension, in a typical latent variable model that maps to real valued scalars f: X → R K .However, some subspace variants of ICA provide exactly such a transformation into spherical coordinates (Hyvärinen et al., 2009, Ch. 10).
Many cryo-EM models use separate latent spaces to represent conformation and pose (Donnat et al., 2022).However, that does not mean that models learn, during nonlinear optimization, to actually use those separate spaces in the intended way.Recent work by Edelberg and Lederman (2023) demonstrated that this is a problem in popular cryo-EM models such as CryoDRGN (Zhong et al., 2021a).In particular, they showed that a 90 °rotation of an image causes a different prediction in the space of conformation latent variables, even though those should be invariant to pose transformations (see below, 3.1).

Disentangling independent factors of conformations
A more fundamental challenge is that of separating the independent degrees of freedom of a molecule.Specifically, we want to find a representation f z of the molecular conformation that inverts, up to some equivalence class ~C like permutations and scaling (see above), the ground truth generative model (f z •g*)(z) = z ~Cz*.
A popular approach (see CryoDRGN tutorial) consists in fitting a nonlinear model to cryo-EM data (Figure 1B) followed by manual investigation of the learned latent space that represents conformational heterogeneity (Zhong et al., 2021a) (Figure 1C), thus limiting our ability to quantitatively compare models.Here, we propose a possible remedy in the shape of benchmarks where we simulate data using the generative model (Eq. 1) to assess how close different methods get to the correct (i.e., up to ~C) representation of conformational latent spaces.This taps into a rich, recent literature in nonlinear ICA methods (Hyvärinen et al., 2024) including benchmarks and metrics for model comparisons (Locatello et al., 2020a).
Once we have benchmarks and metrics, we can measure quantitative progress.However, none of the existing heterogeneous reconstruction approaches in computational cryo-EM are identifiable-mirroring the state of the disentangled representation machine learning field in 2019 (Locatello et al., 2019).To actually make progress, in this perspective, we propose three potential approaches to apply nonlinear ICA method for the unsupervised discovery of molecular conformational changes: 1. Time-resolved single particle imaging.Observing conformational changes over time, such as a sparse change in a single conformational degree of freedom, provides valuable information; this relies on nonlinear ICA methods that use temporal autocorrelations of the sources (Section 4.2.1).

Disentangling pose and conformation
In this section we will first discuss the problem of separating pose and conformation in cryo-EM latent variable models.Recent experiments by Edelberg and Lederman (2023) demonstrated that this desired disentanglement is, unfortunately, violated in the case of CryoDRGN (Zhong et al., 2021a).We start by proposing more systematic evaluations and metrics to measure progress on this task (Section 3.1).Note that those are not metrics in a strict mathematical sense, but rather indices that allow us to measure progress.These metrics inspire simple supervised intervention experiments that can be executed in simulation and added to existing training pipelines to disentangle pose and conformation in cryo-EM latent spaces (Section 3.2).

Evaluating disentanglement of pose and conformation
We are interested in the cryo-EM ground truth generative function g*(z*, ϕ*) π ϕ* (v z* * ) (Eq. 1), which consists of a known pose function π ϕ* π(., ϕ*), an unknown volume function v z* * v*(., z*), an unknown pose ϕ* and an unknown conformation z*.Now, for any specific image, we have full control over the pose function π ϕ* but do not know the pose ϕ*; however, for the conformation we have neither control over the volume function v z* * nor knowledge of the true conformation z* (Shannon et al., 1959).Consequently, in this section and in Section 3.2 we leverage the fact that we have complete knowledge about the pose function, to measure and constrain the flexibility of the conformation z and volume function v z that we learn in our model.
Put simply, what we want is that, for a molecule with fixed conformation, our model predicts the same conformation even if we change the pose of the molecule.That is, we want the conformation representation f z to be invariant to pose changes.Additionally, we want the pose representation f ϕ to be invariant to conformation changes.Mathematically, the requirements of invariance can be written as for all possible poses ϕ ∈ Φ and conformations z ∈ Z.When we train a model, this can go wrong both in our encoder f(x) (if it fails to separate pose and conformation), or in our decoder g(z, ϕ) = π(v z , ϕ) (if the volume function v z learns to represent pose changes).Moreover, it may be necessary to add observation noise to the generated images g(z, ϕ) + ϵ to mitigate for domain shift between the training data and these simulations.To measure progress in this challenge, we can turn this into six different evaluation metrics.We introduce those six in Appendix A. In practice, a single metric (Alg. 1, Eq. 2) seems to suffice as we will discuss in the next section.

Correcting disentanglement of pose and conformation
1. Encode a batch (x i ) i 1,...,N of images into conformations and poses (z 2. Detach all z i and ϕ i from the computation graph.a 3. Random shuffle the poses ϕ i ′ ϕ σ(i) 5. Measure the distances d(., .) to the original latents c 6. Optimize the encoder and decoder (f θ1 , g θ2 ) along the derivatives ( ∂L ∂θ1 , ∂L ∂θ2 ) 7. Repeat 1. to 6. until convergence; or add L(f θ1 , g θ2 ) to total loss function in regular training a We treat these as given latents and do not differentiate with respect to their initial computation b Using a random permutation σ instead of a perturbation δ ϕ ensures that we stay within the posterior distribution p(ϕ|x) of poses c In the conformation space, this could just be Euclidean; in pose space we would have to compute, e.g., the geodesic distance in SO(3).Based on the ideas proposed in the previous section and the metrics in App.5.1, we are now going to propose a simple penalty term that can be added to existing cryo-EM models to disentangle pose and conformation.The logic behind these intervention experiments is illustrated in Figure 3.This procedure relies on the physics-based decoder g with an explicit pose representation π ϕ .Typically, the representation f θ1 and the generator g θ2 are parameterized as neural networks with learnable parameters θ 1 and θ 2 (Kingma and Welling, 2013;Zhong et al., 2021a).Clearly, we can compute the gradients of all metrics with respect to those parameters.In practice, we observed that good results can be achieved simply by following Algorithm 1.Note that this is a straightforward, supervised learning objective that is a relatively standard problem in modern machine learning and should present little difficulty.Thus, we can just add L(f θ1 , g θ2 ) (Algorithm 1) as an additional penalty term to the loss function of any of the existing models with separate pose and conformation representation to encourage disentanglement.
Importantly, we are only able to write this approach in such a concise and easy form because of the physics-based decoder g.By this we mean the fact that we know the physics, i.e., optics behind the projection π ϕ in the image formation model (Eq.1).We could imagine a different cryo-EM generative model where both the conformation and the pose change are modeled by the implicit volume representation v: R 3 × Z′ with some extended latent space Z′.Or, in even more general terms, we could just train a standard VAE (Kingma and Welling, 2013) on cryo-EM images to learn a neural network encoder f: X → Z″ and decoder g: Z″ → X back to image space with some, potentially, even more abstract latent space Z″ Miolane et al. (2020).However, such abstract models would not have the built-in physics of objects in space, their poses ϕ and their projections π ϕ onto a two dimensional image, which we assume a priori in our standard cryo-EM decoder (Eq.1).In other words, such more abstract models would lack the architectural distinction between z and ϕ which we need in our intervention experiment to disentangle pose and conformation.Thus, we would not be able to manipulate distinct parts of the extended latent spaces (Z′ or Z″), knowing that those represent distinct physical manipulations of the image.
For models using an implicit representation of the volume, the reason we have to use this interventional approach to disentangle pose and conformation in the first place is that the implicit representation v(z) is a highly flexible neural network that can easily model pose changes (Sitzmann et al., 2020).Only by combining this with the constraints physics (i.e., the image formation optics) are we able to disentangle pose and conformation representation.This is akin to disentanglement approaches that use the assumption of sparse manipulations, i.e., pairs of data points where only subsets of the latents are modified (Locatello et al., 2020b).Those models have been demonstrated to solve the nonlinear ICA problem theoretically and practically.Thus, whenever we know something about the physics of the world it makes our representation learning task much simpler if we can run intervention experiments that test the causal dependencies between our latent variables (Ahuja et al., 2023;Squires et al., 2023).
We performed a small proof-of-concept experiment to test these predictions and report the results in Figure 3.We train a standard VAE (with separate pose and, implicit, volume representation) on pseudo cryo-EM data and compare it to the same model but with the additional training step in Algorithm 1.We refer to that model as PoseVAE.In Figure 3B, we see that the additional penalty term in PoseVAE does, indeed, succeed at lowering the pose disentanglement metric (Eq.2).Inspecting the latent representations (Figure 3E, middle), we observe that the pose is now fully confined to the pose variable.Moreover, we observe that the conformation space z is, itself, becoming more disentangled (Figure 3E, right).Intuitively, this makes sense because less can go wrong now in encoding two instead of three variables into it.Quantitatively, this observation is confirmed by standard disentanglement metrics showing that PoseVAE achieves higher mean correlation coefficient (MCC) both across all latents (Figure 3C, left) but also within the conformation latents z alone (Figure 3,right).This is encouraging for the next task of disentangling the conformations.

Disentangling conformations
Analogously to the previous section, in a second step, we propose a theoretical framework with metrics and benchmarks concerning the further disentanglement of the individual components inside the conformation vector z (Figure 1).This addresses the essential challenge of interpretable cryo-EM conformational representations for heterogeneous reconstruction (Section 4.1).These benchmarks will help us measure true progress in the budding field of computational cryo-EM (June 2023: Cryo-EM Heterogeneity Challenge).Lastly, we discuss different methods to leverage recent development in nonlinear ICA that have the potential to build the next-generation of cryo-EM models that get closer to the true answer (Section.4.2).These models may require future technological advancements such as temperature dependent cryo-EM (Bock and Grubmuller, 2022) or time-resolved single particle (X-ray) imaging (Shenoy et al., 2023a;b).

Evaluating disentanglement of independent factors of conformations
We consider the disentanglement of the individual components or dimensions inside the conformation space z.We propose a metric, in the form of a computational procedure, to evaluate whether the independent components of conformational variations are disentangled in the latent space.Essentially, this proposal relies on simulated data that exactly fulfills the generative model (Eq. 1) which we assume for cryo-EM data.Having full control over the generative model is important, not only to measure progress, but also to simulate extended datasets (e.g., time-resolved imaging), because we know that only those future datasets with additional assumptions will, provably, allow progress in disentangling conformations.Otherwise, this challenge is hopeless (Locatello et al., 2019).Thus, in the ideal case, we have access to a good cryo-EM simulator g*, so we can just use this.
However, if we do not have a good cryo-EM simulator, we can just use the existing state-of-the-art model and see if we can recover its latents.More precisely, we can do the following: Train a regular cryo-EM model with the additional training loss (Algorithm 1) that ensures that pose and conformation are disentangled.Then, check that the model is approximately, invertible.This is a common assumption in ICA theory (Hyvärinen et al., 2024) to make sure that the task of recovering the sources is well-defined.For this, we basically want to make sure that no two distinct points in conformation latent space z 1 ≠ z 2 would lead to the same volume representation v(z 1 ) = v(z 2 ).Once this is, approximately, validated we can use the model as a cryo-EM simulator.Intuitively, we now treat this first model as ground truth generator g*≔g and see if we can recover its latents z*≔z.
The procedure to evaluate and benchmark heterogeneous cryo-EM latent variable models would then be to assess how well they learn the same (up to equivalences) conformation latent space as the original model.Thus, we would effectively sample a ground truth pose z* and some random pose ϕ* and feed them into the ground truth model to obtain an image x = g*(z*, ϕ*) + ϵ.We then process this image with a candidate model f z (x) = z to obtain the learned conformation representation z.Consequently, we have to compare the two vector representations z* and z.Depending on the equivalence class (~C) that we are interested, there are many different metrics to assess how well z* is disentangled in z.Intuitively, we want some kind of one-to-one correspondence between the two representations where changing a single entry in z* corresponds to changing a single entry in z, and vice versa.Fortunately, this problem has been studied extensively in the machine learning subfield of disentangled representation learning (Bengio et al., 2013), with many proposed metrics and standardized benchmarks (Locatello et al., 2020a).We can build on those advances to get better quantitative measures on progress in heterogeneous cryo-EM reconstruction than volume based comparisons.
As an example metric measuring disentanglement, we will discuss the Mean Correlation Coefficient (MCC) (Hyvarinen and Morioka, 2017).Intuitively, we want each learned latent variable to be perfectly correlated (or anti-correlated, since sign flips do not compromise interpretability) with a single source variable.
To measure this we can just compute the (absolute) correlation coefficient between all ground truth latents z* and all learned latents z.To account for permutations, we have to solve a linear sum assignment with a permutation σ: {1, . .., K} → {1, . .., K}, which basically finds the best matching z σ(i) * for each z i .The MCC is then, simply, the mean over those matches with corr() denoting correlation.Other metrics focus on decodability, or informational independence (Locatello et al., 2019) and there is no agreed-upon consensus on the optimal disentanglement metric.Thus, we simply report scores across all metrics-these can be further grouped by rank ordering to get overall model comparison scores (see Klindt et al., 2020).

Correcting disentanglement of independent factors of conformations
Let us now discuss the hardest task, i.e., finding the independent degrees of freedom that determine the conformation of a molecule (Figure 1).This is a hard problem in the sense that, for a nonlinear function g (Eq.1), without any additional assumption it has been known for the last 2 decades that this is, practically, impossible (Hyvärinen and Pajunen, 1999).Moreover, the field of disentangled representation learning (Bengio et al., 2013) has spent multiple years proposing methods that were, ultimately, unidentifiable (Locatello et al., 2020a).Going forward, computational cryo-EM should learn from those lessons and avoid the same pitfalls.As a very basic example, if our conformation latent space has an isometric Gaussian prior, as in standard VAEs (Kingma and Welling, 2013), we can always perform a random rotation on the learned latents without changing the likelihood of the model (Hauberg, 2018).Thus, any direction in latent space may be representing the actual isolated change in conformation of the molecule.Fortunately, recent years have seen the development of different methods that solve the problem of nonlinear ICA (Hyvärinen et al., 2024).Below, we propose different approaches that are in technological reach (Section 4.2.1), or that make additional statistical assumptions that fit cryo-EM data (Sectioin 4.2.2) or that integrate additional physical knowledge to constrain the problem (Section 4.2.3).

Time-resolved single particle imaging
If we had temporal data of conformation changes for the same molecule over time, we could start applying nonlinear ICA methods that depend on temporal autocorrelations of the sources (Hyvarinen and Morioka, 2017;Hälvä et al., 2021).Specifically, these methods operate under the assumption that we are able to record a time series like  (i)  t ~Lap(μ 0, λ > 0), and the transitions are independent between components, i.e., Klindt et al. (2020) showed that those assumptions are often verified on natural video data, which is important since making additional statistical assumptions to obtain identifiable models is only useful if those assumptions are, actually, aligned with the statistics of real world data.
Practically, such a model leads to a minimal modification to standard VAE training, where now the temporal difference of latents is also penalized to follow the specified transition distribution.However, the crucial difference in this learning paradigm is having access to temporal data (x t , x t+1 ).While this is not routinely feasible experimentally, efforts to develop timeresolved cryo-EM (Mäeots and Enchev, 2022;Lorenz, 2024) will eventually enable the direct observation of protein dynamics in the microseconds to seconds range, yielding datasets where each particle image will be associated to a timestamp that can be readily deployed in the modified modeling approach above.
We performed these disentanglement experiments in Figure 4.In the first row, we have a demonstration of sparse transitions (drawn from a Laplace distribution) that show changes in some of the latent variables (z 0 , z 1 , ϕ).Below, we trained N = 50 models with and without temporal prior (Klindt et al., 2020) and measure seven typical disentanglement metrics (Locatello et al., 2019).We observe that in six out of seven of those metrics, the model with temporal prior, SlowVAE, does, indeed, achieve higher disentanglement scores.Further improvements could be achieved by including the pose loss from the previous section.However, this is already a promising proof-of-concept for future disentanglement of conformation latents based on temporal data.

Controlling the Boltzmann distribution
The idea above applies to time-resolved experiments studying transient dynamics, triggered by some process such as mixing with a ligand or light excitation.Another class of experiments is concerned with steady-state dynamics where timestamps labels are not helpful.For those experiments, a potentially useful knob that could help solve the non-linear ICA problem is knowledge of the temperature associated with each particle in the dataset.The effect of cooling has been reviewed (Bock and Grubmuller, 2022) and different studies have used temperature to change the conformation distribution to obtain insights (Chen et al., 2019;Mehra et al., 2020).Experimentally, this could either be achieved by freezing grids with different cryogens, although a preferable approach would follow the development of thermochromic molecular probes able to report on the local temperature on the cryo-EM grid (Kortekaas and Browne, 2019).This way, precise temperature labeling of each particle in the dataset could be achieved.
Formally, manipulating the temperature τ would provide us with control over the Boltzmann distribution of molecular conformations with Q(τ) the canonical partition function and ε(z) the energy of being in conformation z.This conditional distribution, where we assume knowledge or experimental control over the temperature, maps onto the theoretical framework of iVAE (Khemakhem et al., 2020) with u = τ.Future theoretical investigations are needed to verify if the additional assumptions for their identifiability results are fulfilled in this setting.However, to build intuition, we can walk through a thought experiment to see how control of the temperature can suffice to discover the independent degrees of freedom in molecular conformation changes (Figure 5A).Assume, again a molecule with two degrees of freedom z 1 , z 2 ∈ R that both follow temperature-dependent normal distributions Now, suppose that at low temperature τ a , we only see variation in the first component z 1 while the second component is nearly constant, i.e., σ 2 2 (τ a ) ≪ σ 2 1 (τ a ).By contrast, at high temperature τ b > τ a , we see that the second component also starts moving, i.e., σ 2 2 (τ a ) ≫ 0. Thus, using temperature alone, we can successfully isolate the different degrees of freedom.Intuitively, this should make it possible to solve the disentanglement task.We could simply fit a model to the data at temperature τ b with the additional constraint that the same model also has to be able to encode the data at temperature τ a , albeit, with only the first latent dimension z 1 .Whether those assumptions bare out in real molecules is not clear, yet, recording at different temperatures is within closer technological reach than timeresolved SPI.

Atomic models
While the previous two proposals require different data, we may also hope to make progress with different models.We saw how the implicit volume representation needs additional care to disentangle pose and conformation information (Section 3.1).Imbuing the generative model with physics inspired structure, allowed us to separate pose from conformation (section 3.2).Maybe, even more physics can help us solve the harder problem of finding the conformational degrees of freedom.In particular, if we replace the highly expressive implicit volume representation v with an atomic model (Zhong et al., 2021b;Rosenbaum et al., 2021;Nashed et al., 2022;Koo et al., 2023), then the pose latent variable z will have to encode how an atomic reference structure (maybe the mode of the conformation space) is deformed into an observed conformation (Figure 5B).
One existing work by Punjani and Fleet (2021) proposes to learn a convection field that deforms a reference volume.This comes with the elegant property of volume preservation which is not always the case in implicit conformation representations, but obeys our knowledge of the underlying physics.However, the learned convection fields as well as the reference volume model are still over-parameterized compared to an ideal atomistic model with movement vectors for each atom.The problem in building smaller and more constraint models is that modern machine learning methods have, to some extent, proven so powerful because they allow heavily overparameterized hypothesis classes that still generalize well beyond the training data.The question then becomes whether we can combine the best of both worlds, i.e., the non-convex optimization and generalization properties of deep neural networks (e.g., for implicit volume or convection representations) with the physical detail of constraint atomistic models (Zhong et al., 2021b;Rosenbaum et al., 2021;Nashed et al., 2022;Koo et al., 2023).

Discussion
In recent years, the integration of computational models, particularly VAEs (Kingma and Welling, 2013), has revolutionized research across various natural sciences, including cryo-EM (Zhong et al., 2021a).This perspective piece underscores the critical importance of understanding conformational latent spaces in cryo-EM by drawing on cutting-edge theoretical advancements in identifiable nonlinear ICA (Hyvärinen et al., 2024).By bridging the gap between theoretical frameworks and practical applications in cryo-EM, we are suggesting a significant advance in the interpretability and utility of latent variable models.Furthermore, our study advocates for the adoption of better quantitative measures to assess progress in heterogeneous cryo-EM reconstruction, transcending traditional volume-based comparisons.This aligns with recent initiatives such as the Cryo-EM Heterogeneity Challenge emphasizing the need for refined evaluation metrics to accurately gauge advancements in this field.Nevertheless, this work is merely an opinion piece and proof-ofconcept demonstration.Significant technical (e.g., time resolved SPI) and engineering challenges (e.g., identifiable nonlinear ICA models that work in low SNR regimes) lie ahead on this path towards interpretable cryo-EM conformations spaces.
One of the key limitations to the approach that we are proposing in this paper is the assumption that there exist a limited number of factors of variation that determine the possible conformations and conformation changes out of all the possible rearrangements of constituent atoms.Moreover, we assume that those factors are independent which means that the molecule has parts that move independently of each other.Prior work in nonlinear ICA has considered the effect of dependencies (Träuble et al., 2021) and how to mitigate them, but that might not even be necessary.It is conceivable that, for instance in the little cartoon Figure 1, both arms of the molecule always move up and down together so that they are not independent.However, if both arms always move together, then this is, likely, to fulfill some biological function.In this scenario, our model would presumably learn to describe the combined motion by a single latent variable which would, thus, represent the motion required to perform this biological function.Consequently, such dependencies might unveil (more complex) molecular motions and biological function that we can, thus, extract.
The history of ICA's emergence in the 1990s, and in particular its early adoption in neuroimaging (McKeown et al., 1998), shows its capacity to evolve into a cornerstone of data-based modeling.This trend, moving even further away from hypothesis-driven research (Friston, 1998) toward data-centric approaches (Beckmann et al., 2005), also underscores the importance of incorporating principles like nonlinear ICA to ensure meaningful model outputs.While modern machine learning techniques, including VAEs or nonlinear dimensionality reduction methods such as t-SNE (Van der Maaten and Hinton, 2008) or UMAP (McInnes et al., 2018), have become ubiquitous in data-based modeling, they often overlook source recovery, i.e., identifiability considerations.To fully harness the potential of latent spaces, it is paramount to ensure their alignment with meaningful representations of the underlying data.
In conclusion, our approach integrates nonlinear ICA principles into the development and analysis of cryo-EM latent variable models, ensuring more interpretable representations that encapsulate the intrinsic structure of the data.Unlocking latent spaces aligned with the underlying fundamental factors governing complex phenomena is pivotal for gaining deep insights into biological processes, expediting drug discovery, and facilitating targeted interventions.This progress extends beyond cryo-EM, resonating with diverse scientific disciplines such as computer vision, natural language processing, and generative modeling, where (VAE) latent spaces play a pivotal role in data representation and the generation of new scientific hypothesis as part of initiatives such as AI4Science.Our interdisciplinary approach, embracing nonlinear ICA and disentanglement models, holds promise in generating meaningful representations that carve nature at the joints, thereby propelling transformative discoveries.
where, ideally, one would use the oracle encoder f*(x) = argmax z,ϕ p(x|z, ϕ) p(z, ϕ).In practice, we can just use the current encoder f(x) = (f z (x) f ϕ (x)) and optimize it as well.Again this can be split into two metrics (consistency and invariance), both for the conformation encoder f z 1. Conformation-encoder consistency, i.e., how accurately the conformation encoder f z recovers any conformation, independent of pose: 2. Conformation-encoder pose invariance, i.e., how invariant the conformation encoder f z is to pose perturbations: as well as for the pose encoder f ϕ 1. Pose-encoder consistency, i.e., how accurately the pose encoder f ϕ recovers any pose, independent of conformation: 2. Pose-encoder conformation invariance, i.e., how invariant the pose encoder f ϕ is to conformation perturbations: where, again, ideally we would like to use the ground truth generator g*, but we can also just use the current learned decoder.All of these six metrics can be evaluated, in a supervised way, over a sufficient number of randomly sampled conformations z, poses ϕ and perturbations (δ z , δ ϕ ).

FIGURE 2
FIGURE 2 (C) Negentropy (i.e., reverse entropy) of the first three principal and independent components.(D) (resp.(E))-(top) histogram of the projection of all image particle parameters on the first three principal (resp.independent) components PC1, PC2 and PC3 (resp.IC1, IC2 and IC3).(bottom) 2D histograms of the projections of all image particle parameters on all pairs of the first three principal (resp.independent) components.(F) (resp.(G)-Maps illustrating the motions carried by IC1 (resp.IC2).(top) map reconstructed from the particles whose projections belong to the last bin along IC1 (resp.IC2).(bottom) map reconstructed from the particles whose projections belong to the first bin along IC1 (resp.IC2).All maps are shown overlaid on the consensus map, with threshold set at a lower density value, colored according to the scheme in (B).

FIGURE 3
FIGURE 3 Physics-based Pose and Conformation Disentanglement.(A) We perform an intervention, i.e., changing only the pose (conformation) of a latent pair; these changed latents are decoded and encoded again to measure the consistency and invariance of our model.(B) Compared to a vanilla VAE, a model PoseVAE trained with interventions (i.e., Alg. 1) achieves lower reconstruction error, lower KL divergence to the prior and a lower pose disentanglement loss (Eq.2).(C) Disentanglement, measured as mean correlation coefficient (MCC (Hyvarinen and Morioka, 2017),), increases not only between pose and conformation variables (left), but also among the conformation variables (right).(D) Visualizations of the learned latents for the vanilla VAE model, showing that the learned angle is not perfectly representing the true angle (plots one and two from the left); the three plots on the right show the learned conformation latents, representing mixtures of true conformation and pose.(E) Same as (D) but for PoseVAE, showing a perfect monotonic relationship between learned and true angle; also the conformation latents contain little information about the true angle (third plot) and disentangle the true conformation variables up to a 45 °rotation.

FIGURE 4
FIGURE 4 Temporal Conformation Disentanglement.(A) Schematic of temporal data pairs (x t , x t+1 ) and minimal changes to VAE training.To obtain SlowVAE, all we need to do is change the conformation prior for the second time step to be a Laplace distribution centered around the posterior mean of the previous time step (Klindt et al., 2020).(B) Example temporal transitions in the two conformation (z 0 , z 1 ) and one pose ϕ latents, drawn from a Laplace distribution (default parameters from Klindt et al. (2020): data rate λ =1, VAE (β = 1, γ = 10), VAE prior rate λ′ = 10.(C) Most common disentanglement metrics, for details see(Locatello et al., 2019).In seven out of eight metrics we see that SlowVAE learns a more disentangled representation of the conformation latents than a regular VAE without any adaptations.

b 4 .
Decode and encode the new conformation and pose pairs into

Algorithm 1
Interventions for Pose and Conformation Disentanglement.

TABLE 1
Glossary.Whenever a distinction is necessary in a given context, we use a p (e.g., g*) to highlight that we are referring to the ground truth model (g*) or ground truth latent variables (z*, ϕ*).For instance, we have the ground truth generator g* of the data, in contrast to the learned generator g (i.e., decoder) from our model of the data.